Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

RNA-Seq Data Analysis ◾ 169

To keep the files organized, we will create two subdirectories in our project directory:

“refgenome” where we will store the reference genome and “gtf” where we will save the

GTF annotation file.

The sequences of reference genomes and annotation are available in many sequence data-

bases such as Ensembl, UCSC, and NCBI genome database. iGenomes built by Illumina

has facilitated the process of downloading the reference data for the frequently analyzed

organisms. Genome builds in FASTA and their annotation in GTF/GFF files from the

above major databases are available for download. The iGenomes website that includes

the download links is available at “https://support.illumina.com/sequencing/sequencing_

software/igenome.html”. Reference data can also be downloaded from “https://hgdown-

load.soe.ucsc.edu/goldenPath/hg38/bigZips/”. For aligning with STAR, we will download

the UCSC human reference genome sequence in FASTA and gene annotation in GTF file

because the chromosomes are indicated by names rather than accession numbers. While

you are in the main directory “rnaseq”, run the following bash script to create the subdi-

rectories and to download the human reference genome and its gene annotation:

mkdir refgenome

wget \

-O “refgenome/hg38.fa.gz” \

“https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.

fa.gz”

gzip -d refgenome/hg38.fa.gz

mkdir gtf

wget \

-O “gtf/hg38.ncbiRefSeq.gtf.gz” \

“https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/genes/

hg38.ncbiRefSeq.gtf.gz”

gzip -d gtf/hg38.ncbiRefSeq.gtf.gz

Mapping reads to a reference genome using STAR is a two-step process: creating a refer-

ence sequence index and then mapping reads to the reference sequence.

The following command creates the STAR index for the reference genome sequence.

The “--runThreadN” specifies the number of processors to use, “--genomeDir” specifies

the directory where the index files will be saved, “--genomeFastaFiles” and “--sjdbGTFfile”

specify the directories of the reference genome file and annotation file, respectively, and

“--sjdbOverhang” specifies the read length -1 (read length minus one).

mkdir indexes

STAR --runThreadN 4 \

--runMode genomeGenerate \

--genomeDir indexes \

--genomeFastaFiles refgenome/hg38.fa \

--sjdbGTFfile gtf/hg38.ncbiRefSeq.gtf \

--sjdbOverhang 150